Predicting Students’ Chance of Admission Using Beta Regression

Author

Marcus Chery and Keiron Green

Published

December 6, 2023

Website

Slides

Introduction

The literature review delves into the development and application of beta regression, a statistical method used for handling continuous bounded response variables typically constrained within the open interval (0, 1). This approach is notably significant in modeling proportions and percentages, offering a nuanced and interpretable framework for understanding the impact of predictor variables. The foundational work by (Ferrari and Cribari-Neto 2004) in introducing beta regression has been a pivotal point in statistical methodology, setting the stage for numerous advancements and applications in this field.

This report delves into the application of beta regression within the context of university admissions data analysis. Our objective is to provide a thorough understanding of beta regression, its practical implementation using the “betareg” package in R, and its real-world relevance through predicting a student’s chance of admission to university given certain factors. Throughout the report, we offer practical examples, code snippets, and graphs to facilitate hands-on learning. Additionally, we address the limitations of beta regression and discuss potential future research directions. The goal of this report is to equip readers with the knowledge and skills needed to effectively utilize beta regression in university admissions data.

Beta regression’s core premise involves assuming the dependent variable to follow a beta distribution. This assumption allows for a flexible yet robust framework to model the relationship between dependent and independent variables, particularly when dealing with data such as rates, proportions, and percentages. The method’s adaptability is further enhanced by using link functions—such as logit, probit, and log-log—facilitating the understanding of how predictor variables influence the response variable’s distribution, both in terms of its mean and variability.

The introduction of the “betareg” package for R by (Cribari-Neto and Zeileis 2010) marked a significant milestone, providing a practical and user-friendly tool for implementing beta regression models. This package has been instrumental in popularizing beta regression in data analysis, particularly in scenarios involving continuous bounded response variables.

Subsequent research has expanded upon the initial framework of beta regression. (Guolo and Varin 2014) introduced a model incorporating Gaussian copula to address serial dependence in time series data, broadening the scope of beta regression to new areas such as epidemiological trend analysis. (Patrícia L. Espinheira and Cribari-Neto 2008) contributed significantly by introducing innovative residuals based on Fisher’s scoring iterative algorithm, thereby enhancing model assessment capabilities and offering solutions for bias correction.

Further advancements in the methodology have been made over the years. (Simas, Barreto-Souza, and Rocha 2010) delved into the issue of bias in parameter estimation within beta regression models, providing strategies for bias correction. (Schmid 2013) developed the concept of “boosted beta regression”, catering to complex modeling situations involving multicollinearity, nonlinear relationships, and overdispersion. Their approach utilizes the gamboostLSS algorithm, enabling efficient estimation of beta regression models and demonstrating their efficacy through applications in ecological data.

(Abonazel et al. 2022) focused on improving estimation in the presence of multicollinearity, proposing the Dawoud–Kibria estimator as an alternative to traditional maximum likelihood estimators. (Douma and Weedon 2019) highlighted the challenges and solutions for analyzing proportional data in ecological and evolutionary studies, advocating for beta and Dirichlet regression techniques. (Couri et al. 2022) evaluated various algorithms for estimating parameters in beta regression models, offering insights into overcoming computational challenges in parameter estimation.

Of particular note is the work of (Ospina and Ferrari 2012), who introduced zero-or-one inflated beta regression models. This approach is specifically tailored for datasets containing continuous proportions that are inflated with zeros or ones, offering a robust solution for such mixed continuous-discrete distributions.

In synthesizing these developments, the review underscores the versatility and continued evolution of beta regression as a statistical tool. Its relevance across various disciplines and data types is evident, showcasing its utility in real-world scenarios. This paper specifically applies beta regression to predict students’ chances of admission, demonstrating the methodology’s applicability in educational data analysis and offering insights into factors influencing admission decisions.

Methods

Our objective is to investigate the relationship and effect of the given variables on the response variable ‘chance_of_admit’. Being that the chosen response is proportional in nature, linear regression is not appropriate. The suitable approach is the implementation of beta regression. The reason as to why linear regression is not appropriate is due to predicted values falling outside the bound of (0,1) where as all the proportional values are within the bounds of 0 and 1.Our objective is to implement beta regression to predict the proportion values based the student’s indicated chance of admission.

After the data has been cleaned appropriately, the beta regression algorithm is implemented. Before delving into the implementation of the beta regression let’s briefly analyze the mathematics of the beta regression algorithm. The beta regression is based on the logit function. The logit function plots probabilities between the interval 0 and 1. The logit function can be seen below. \[ logit(p) = log(p/(1-p))\]

Another formula which will need to be accounted for in predicting the proportionality values is the beta density.The beta density is especially useful in modelling probabilities as it works on the interval [0,1] When implementing beta regression it is assumed that the response variable follows a beta distribution hence the name beta regression.

The beta density formula is below. \[ f(x;\alpha,\beta) =\frac{ \Gamma(p + q)}{\Gamma(p)*\Gamma(q)}x^{p-1}(1-x)^{q-1} \] , \[ 0 < y < 1 \] where \(p,q\) > 0 and \(\Gamma (.)\) is the Gamma function. An alternate parameterization was proposed by (Ferrari and Cribari-Neto 2004) by setting \(u = p/(p+q)\) and \(\phi = p + q\)

\[ f(y;\mu,\phi)=\frac{\Gamma(\phi)}{\Gamma(\mu\phi)\Gamma((1-\mu)\phi)} y^{\mu\phi-1}(1-y)^{(1-\mu)(\phi-1)} \]

\[0<y<1\]

with \(0 < u < 1\) and \(\phi > 0\).

We denote \(y\sim B(u,\phi)\) where \(E(y)=u\) and \(VAR(y)=u(1-u)/(1+\phi)\). The \(\phi\) is known as the precision parameter. For a fixed value of \(u\) as the \(\phi\) increases the variance of the response variable \(y\) decreases.

Let \(y_1,.....,y_n\) be a random sample set given that $y_i(u_i,), i = 1,……,n. The beta regression model is defined as \[ g(u_i) = x^T_iB = n_i\] where \(B=(B_1,....,B_k)\) is a \(k*1\) vector of unknown regression parameters \((k<n),x_i=(x_i,......x_n)^T\) is the vector of k regressors(independent variables or covariates) and \(n_i\) is a linear predictor - i.e.,\(n_i = B_1x_{i1}+...+B_kx_{ik};\) usually \(x_{i1}\) = 1 for all \(i\) so that the model has an intercept.The \(g(.): (0,1) -> IR\) represent the link function.

Data Exploration

This section aims to identify key variables for constructing a beta regression model to predict admission chances. Initially, we analyze the dataset to determine the correlation between various predictors and the chance of admission. Following this, we use these insights to develop and evaluate several models, focusing on their predictive accuracy and effectiveness. This process is critical for establishing a reliable model that accurately forecasts admission probabilities based on the identified significant predictors.

Data and Visualization

Data: University Admission Data

Attribute Information

Variable P arameter Range D e s cription
GRE Scores g re_score

290 - 340

(340 scale)

Q u antifies a c a n didate’s p e r formance on the Graduate Record E x a m ination, with a maximum score of 340
TOEFL Scores to e fl_score

92 - 120

(120 scale)

Measures English language p r o f iciency, scored out of a total of 120 points
Un iversity Rating universi t y_rating 1 to 5 with 5 being the highest rating Rates u n i v ersities on a scale from 1 to 5, i n dicating their overall quality and r e p utation.
S tatement of Purpose (SOP) Strength sop 1 to 5 with 5 being the highest rating E valuates the strength and quality of a c a n didate’s SOP on a scale of 1 to 5
Letter of Reccomm endation (LOR) Strength lor 1 to 5 with 5 being the highest rating E valuates the strength and quality of a c a n didate’s SOP and LOR on a scale of 1 to 5
Under graduate GPA cgpa

6.8 - 9.92

(10.0 scale)

Reflects a s tudent’s academic p e r formance in their un d e r graduate studies, scored on a 10-point scale
Research Ex perience research 0 or 1 I ndicates whether a c andidate has research e x perience (1) or not (0).
Chance of Admit chance

_ of_admit

0.34 - 0.97

(0 to 1 scale)

R e presents the l i kelihood of a student being a dmitted, e xpressed as a decimal between 0 and 1

Libraries

Load necessary packages for analysis and modeling.

Code
#install.packages("janitor")
#install.packages("caTools")
#Insatll required packages
#install.packages('caret')
library(caret)
library(reshape2)
library(tidyverse)
library(ggplot2)
library(readr)
library(dplyr)
library(betareg)
library(lmtest)
library(car)
library(rcompanion)
library(janitor)
library(here)
library(caTools)
library(pROC)

Loading Dataset

Code
admission <- read_csv("adm_data.csv")
#str(admission)

admission <- admission %>%
  clean_names()

head(admission,5)
# A tibble: 5 × 9
  serial_no gre_score toefl_score university_rating   sop   lor  cgpa research
      <dbl>     <dbl>       <dbl>             <dbl> <dbl> <dbl> <dbl>    <dbl>
1         1       337         118                 4   4.5   4.5  9.65        1
2         2       324         107                 4   4     4.5  8.87        1
3         3       316         104                 3   3     3.5  8           1
4         4       322         110                 3   3.5   2.5  8.67        1
5         5       314         103                 2   2     3    8.21        0
# ℹ 1 more variable: chance_of_admit <dbl>

Model Assumptions

Assumption Description Assumption Met
Beta Distribution of Dependent Variable The dependent variable is continuous and bounded between 0 and 1. Met
Independent Observations Observations are independent of each other. Assumed
Homo scedasticity of Error Terms The variance of error terms are constant across levels of independent variables. Assumed
Linearity of Predictors There are linear relationship between logit of the expected value and predictors. Met
No Perfect Multi collinearity Predictors are not perfectly collinear. Met
No Outliers The data does not have influential outliers. Met
Code
hist(admission$chance_of_admit, breaks=20, main="Histogram of Chance of Admit", xlab="Chance of Admit")

The dependent variable is continuous and bounded between 0 and 1.

Code
pairs(~ chance_of_admit + gre_score + toefl_score + university_rating + sop + lor + cgpa + research, data = admission)

There are linear relationship between logit of the expected value and predictors.

Code
library(car)
vif(lm(chance_of_admit ~ gre_score + toefl_score + university_rating + sop + lor + cgpa + research, data = admission))
        gre_score       toefl_score university_rating               sop 
         4.615516          4.288959          2.919606          3.075504 
              lor              cgpa          research 
         2.431258          5.207403          1.543312 

VIF scores are not greater than the chosen threshold of 10

Code
boxplot(admission$chance_of_admit, main = "Boxplot of Chance of Admit", ylab = "Chance of Admit")

Code
par(mfrow = c(2, 4))  # Adjust layout to display multiple plots
boxplot(admission$gre_score, main = "GRE Score")
boxplot(admission$toefl_score, main = "TOEFL Score")
boxplot(admission$university_rating, main = "University Rating")
boxplot(admission$sop, main = "SOP")
boxplot(admission$lor, main = "LOR")
boxplot(admission$cgpa, main = "CGPA")
boxplot(admission$research, main = "Research")

The data does not have influential outliers

Further EDA

Chance of Admit and TOEFL

Based on exploratory data analysis, TOEFL scores appear to be associated with a greater chance of admission, this chance of admission is further augmented by higher university ratings, and research experience appears to be a strong factor in increasing both TOEFL scores and the likelihood of admission.

Code
ggplot(admission, aes(x = toefl_score, y = chance_of_admit)) +
  geom_point(color = "#1f77b4", alpha = 0.6) +
  geom_smooth(method = "lm", color = "black", se = FALSE) +
  geom_smooth(method = "loess", color = "orange", se = FALSE) +
  labs(x = "TOEFL Score", y = "Chance of Admission", title = "Linear and Smooth Fit: Chance of Admission vs TOEFL Score") +
  theme_minimal() +
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        legend.position = "bottom")

The correlation coefficient of 0.791594 between TOEFL scores and the chance of admit suggests a strong positive relationship. This indicates that as TOEFL scores increase, the chance of admission tends to increase as well.

Code
ggplot(admission, aes(x = toefl_score, y = chance_of_admit, color = as.factor(university_rating))) +
  geom_jitter(width = .2, size = I(3)) +
  scale_color_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd")) +
  labs(x = "TOEFL Score", y = "Chance of Admission", color = "University Rating", 
       title = "Chance of Admit by TOEFL Score per University Ranking") +
  theme_minimal() +
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10)) +
  geom_smooth(method = "lm", se = FALSE)

This trend indicates that applicants to higher-rated universities generally have a higher chance of admission. The standard deviation decreases as the university rating increases, suggesting more consistency in admission chances at higher-rated universities.

Code
ggplot(admission, aes(x = toefl_score, y = chance_of_admit, color = as.factor(research))) +
  geom_jitter(width = .2) +
  scale_color_manual(values = c("#270181", "coral")) +
  labs(x = "TOEFL Score", y = "Chance of Admission", color = "Research Experience", 
       title = "Chance of Admit by TOEFL Score per Research Experience") +
  theme_minimal() +
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"),
        legend.position = "bottom") +
  geom_smooth(method = "lm", se = FALSE)

Applicants with research experience (Research = 1) have a higher average TOEFL score (approx. 110) compared to those without research experience (Research = 0), who have an average TOEFL score of about 104.

The mean chance of admission for applicants with research experience is significantly higher (approx. 0.796) than for those without (approx. 0.638).

This data suggests that research experience is positively associated with both higher TOEFL scores and a greater likelihood of admission.

Chance of Admit and GRE

These analyses suggest that higher GRE scores are strongly correlated with an increased chance of admission. The likelihood of admission also appears to be influenced by the university rating and is further enhanced by research experience.

Code
library(ggplot2)

ggplot(admission, aes(x = gre_score, y = chance_of_admit)) +
  geom_point(color = "#1f77b4", alpha = 0.6) +
  geom_smooth(method = "lm", color = "black", se = FALSE) +
  geom_smooth(method = "loess", color = "orange", se = FALSE) +
  labs(x = "GRE Score", y = "Chance of Admission", 
       title = "Chance of Admit vs GRE Score") +
  theme_minimal() +
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        legend.position = "bottom") +
  guides(color = guide_legend(title = "Type of Fit", 
                              override.aes = list(linetype = c("solid", "dashed"))))

A correlation coefficient of 0.8026105 indicates a strong positive relationship between GRE scores and the chance of admission. This suggests that higher GRE scores are generally associated with a higher likelihood of being admitted.

Code
ggplot(admission, aes(x = gre_score, y = chance_of_admit, color = as.factor(university_rating))) +
  geom_point() + 
  scale_color_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd")) +
  labs(x = "GRE Score", y = "Chance of Admission", color = "University Rating", 
       title = "Chance of Admit by GRE Score per University Ranking") +
  theme_minimal() +
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"),
        legend.position = "bottom") +
  geom_smooth(method = "lm", se = FALSE)

This trend suggests that applicants to higher-rated universities have a higher chance of admission, with the chance of admission being most favorable at the highest-rated universities.

Code
ggplot(admission, aes(x = gre_score, y = chance_of_admit, color = as.factor(research))) +
  geom_point() + 
  scale_color_manual(values = c("#270181", "coral")) +
  labs(x = "GRE Score", y = "Chance of Admission", color = "Research Experience", 
       title = "Chance of Admit by GRE Score per Research Experience") +
  theme_minimal() +
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"),
        legend.position = "bottom") +
  geom_smooth(method = "lm", se = FALSE)

Applicants with research experience (Research = 1) have a higher average GRE score (about 323) compared to those without research experience (Research = 0), who have an average GRE score of approximately 309. Similarly, the mean chance of admission is significantly higher for applicants with research experience (approx. 0.796) than for those without it (approx. 0.638). This indicates that research experience is positively associated with both higher GRE scores and a greater likelihood of admission.

Chance of Admit and CGPA

Code
ggplot(admission, aes(x = cgpa, y = chance_of_admit, color = as.factor(university_rating))) +
  geom_point() + 
  scale_color_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd")) +
  labs(x = "G.P.A Score", y = "Chance of Admission", color = "University Rating", 
       title = "Chance of Admit by G.P.A per University Ranking") +
  theme_minimal() +
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"),
        legend.position = "bottom") +
  geom_smooth(method = "lm", se = FALSE)

Applicants with a higher g.p.a have a higher acceptance probability as the university ranking goes from low(1) to high(5). The correlation value of 0.87 indicates a strong positive relationship between G.P.A score and the chance of admission. This suggests that higher a G.P.A score are generally associated with a higher likelihood of being admitted.

Chance of Admission Correlation Heatmap

Code
library(ggplot2)
library(reshape2)
library(viridis)

# Calculate the correlation matrix
data <- cor(admission[sapply(admission, is.numeric)], use = "complete.obs")

# Reshape data for ggplot
data1 <- melt(data)

# Create the heatmap
ggplot(data1, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_viridis(option = "C", direction = -1) +  # Using viridis color scale
  labs(title = "Admission Correlation Heatmap", x = "", y = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 10),
        axis.text.y = element_text(size = 10),
        plot.title = element_text(color = "black", size = 14, face = "bold"),
        legend.position = "right") +
  geom_text(aes(label = round(value, digits = 2)), color = "white", size = 3)

The heatmap shows that GRE scores, TOEFL scores, and CGPA are strongly and positively correlated with the chance of admission. Research experience also positively influences admission chances, albeit to a lesser extent than academic scores.

Analysis and Results

Given the exploratory data analysis, we will now construct a predictive model for the response variable chance_of_admit.

Code
# Scaling of Data
#admission <- scale(admission)
#admission <- as.data.frame(admission)
Code
# Splitting the Data

#make this example reproducible
set.seed(1)

#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(admission), replace=TRUE, prob=c(0.7,0.3))
train  <- admission[sample, ]
test   <- admission[!sample, ]

#view dimensions of training set
#dim(train)

#view dimensions of test set
#dim(test)

X_train <- train

X_test = subset(test,select = -c(chance_of_admit))
keeps <- c("chance_of_admit")
y_test = test[keeps]

Fitting Model 1

Model 1 = chance_of_admit ~ gre_score + toefl_score + university_rating + sop + lor + cgpa + research

Code
gy_logit <- betareg(chance_of_admit ~ gre_score + toefl_score + university_rating + sop + lor + cgpa + research, data = train)

model_summary <- summary(gy_logit)
# Extracting coefficients for the mean model
coefficients_mean <- model_summary$coefficients$mean
coeff_mean_df <- data.frame(round(coefficients_mean, 4))
colnames(coeff_mean_df) <- c("Estimate", "Std. Error", "z value", "Pr(>|z|)")

# Extract log-likelihood and pseudo R-squared
log_likelihood <- round(model_summary$loglik, 2)
pseudo_r_squared <- round(model_summary$pseudo.r.squared, 4)


print("Coefficients (Mean Model):")
[1] "Coefficients (Mean Model):"
Code
print(coeff_mean_df)
                  Estimate Std. Error  z value Pr(>|z|)
(Intercept)        -9.7358     0.7785 -12.5066   0.0000
gre_score           0.0084     0.0035   2.4222   0.0154
toefl_score         0.0190     0.0065   2.9333   0.0034
university_rating   0.0481     0.0301   1.5991   0.1098
sop                -0.0577     0.0333  -1.7329   0.0831
lor                 0.1233     0.0358   3.4438   0.0006
cgpa                0.6561     0.0699   9.3840   0.0000
research            0.1499     0.0467   3.2080   0.0013
Code
cat("\nLog-Likelihood:", log_likelihood, "\nPseudo R-squared:", pseudo_r_squared)

Log-Likelihood: 408.75 
Pseudo R-squared: 0.8275
Code
coefficients_df <- data.frame(
  Term = rownames(model_summary$coefficients$mean),
  Estimate = model_summary$coefficients$mean[, "Estimate"],
  Std_Error = model_summary$coefficients$mean[, "Std. Error"]
)
coefficients_df <- coefficients_df[coefficients_df$Term != "(Intercept)",]

Next we use the ‘regsubsets’ function to perform subset selection, aiming to identify the most predictive variables for the ‘chance_of_admit’ outcome. This approach systematically evaluates combinations of up to seven predictors, such as ‘gre_score’, ‘toefl_score’, ‘cgpa’, and others, to determine their impact on the chance of admission. The goal is to find the optimal set of variables that best predict the admission outcome with varying model complexities.

The output indicates which variables are included in the best models at each complexity level, revealing that ‘cgpa’, ‘gre_score’, and ‘lor’ are significant predictors in simpler models. As more variables are added, the model becomes more complex, potentially improving accuracy but also increasing the risk of overfitting. This analysis helps in understanding the relative importance of different academic and profile factors in determining admission chances. In the analyzing the predictors and their impact, it is determined that the greatest factor affecting the probability of admission is g.p.a. by a factor of.64. This is followed by research with a factor of 0.15 and the combination of statement of purpose and letter of recommendation with a factor of 0.07.

Code
library(leaps)
models <- regsubsets(chance_of_admit~. -serial_no, data = admission, nvmax = 9,
                     method = "seqrep")

library(broom)

model_summaries <- summary(models)
num_models <- nrow(model_summaries$which)

# Initialize an empty data frame for the summary
summary_table <- data.frame(Size = integer(), 
                            Variables = character(), 
                            R.Squared = numeric(), 
                            Adj.R.Squared = numeric(), 
                            BIC = numeric(), 
                            stringsAsFactors = FALSE)

# Fill the summary table
for (i in 1:num_models) {
  included_vars <- names(which(model_summaries$which[i, ]))
  vars_str <- paste(included_vars, collapse = ", ")
  summary_table <- rbind(summary_table, 
                         data.frame(Size = i, 
                                    Variables = vars_str, 
                                    R.Squared = model_summaries$rsq[i], 
                                    Adj.R.Squared = model_summaries$adjr2[i], 
                                    BIC = model_summaries$bic[i]))
}

# Print the table
print(summary_table)
  Size
1    1
2    2
3    3
4    4
5    5
6    6
7    7
                                                                         Variables
1                                                                (Intercept), cgpa
2                                              (Intercept), gre_score, toefl_score
3                                                (Intercept), gre_score, lor, cgpa
4                                      (Intercept), gre_score, lor, cgpa, research
5                 (Intercept), gre_score, toefl_score, university_rating, sop, lor
6           (Intercept), gre_score, toefl_score, university_rating, sop, lor, cgpa
7 (Intercept), gre_score, toefl_score, university_rating, sop, lor, cgpa, research
  R.Squared Adj.R.Squared       BIC
1 0.7626339     0.7620375 -563.2776
2 0.6925050     0.6909559 -453.7442
3 0.7941207     0.7925610 -608.2202
4 0.7986651     0.7966263 -611.1569
5 0.7504324     0.7472653 -519.2615
6 0.7987119     0.7956388 -599.2669
7 0.8034714     0.7999619 -602.8472

In this analysis 10-fold cross-validation is employed to provide assessment of the model’s predictive performance by dividing the data into ten subsets and iteratively training the model on nine subsets while testing on the remaining one. The procedure is repeated for models with varying numbers of predictors. Key performance metrics such as RMSE (Root Mean Squared Error), R-squared, and MAE (Mean Absolute Error) are calculated for each model, showing how the inclusion of additional variables impacts the model’s accuracy. The results show variations in model performance across different numbers of predictors, with certain models achieving lower RMSE and higher R-squared values

Code
# Set seed for reproducibility
set.seed(123)
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 10)
# Train the model
step.model <- train(chance_of_admit ~. -serial_no, data = admission,
                    method = "leapSeq", 
                    tuneGrid = data.frame(nvmax = 1:7),
                    trControl = train.control
                    )
step.model$results
  nvmax       RMSE  Rsquared        MAE      RMSESD RsquaredSD       MAESD
1     1 0.06889355 0.7635275 0.05110306 0.009641911 0.06007377 0.007013934
2     2 0.07889825 0.6984116 0.06031014 0.013609817 0.06846210 0.010474096
3     3 0.06452847 0.7965067 0.04693197 0.009982503 0.05017603 0.007105953
4     4 0.06692766 0.7838953 0.04978752 0.015166549 0.06632972 0.012106201
5     5 0.06929365 0.7693051 0.05126744 0.008662209 0.05018380 0.005967808
6     6 0.06422188 0.7988891 0.04638031 0.010252374 0.05087166 0.007242793
7     7 0.06347942 0.8033827 0.04585758 0.010594007 0.05415636 0.007976196

We employ beta regression models to predict the chance of admission, utilizing various combinations of predictors such as GRE scores, TOEFL scores, university ratings, letters of recommendation, CGPA, and research experience. The primary objective is to determine the optimal set of predictors that can accurately forecast admission chances while maintaining model simplicity.

The models, labeled ‘gy_logit1’ to ‘gy_logit7’, were evaluated using the Akaike Information Criterion (AIC) and the pseudo R-squared metric. A lower AIC value indicates a more efficient model in terms of the trade-off between goodness of fit and complexity, while a higher pseudo R-squared value suggests a better model fit.

The outcomes reveal that ‘gy_logit4’ and ‘gy_logit7’ yield the lowest AIC values, indicating they are potentially the most efficient models among those tested.

Code
gy_logit1 <- betareg(chance_of_admit ~ gre_score + toefl_score + university_rating + lor + cgpa , data = train)

gy_logit2 <- betareg(chance_of_admit ~ gre_score + toefl_score + university_rating + lor + research, data = train)

gy_logit3 <- betareg(chance_of_admit ~ gre_score + toefl_score + university_rating + cgpa + research, data = train)

gy_logit4 <- betareg(chance_of_admit ~ gre_score + toefl_score + lor + cgpa + research, data = train)

gy_logit5 <- betareg(chance_of_admit ~ gre_score + university_rating + lor + cgpa + research, data = train)

gy_logit6 <- betareg(chance_of_admit ~ toefl_score + university_rating + lor + cgpa + research, data = train)

gy_logit7 <- betareg(chance_of_admit ~ gre_score + toefl_score + university_rating + lor + cgpa + research, data = train)

# Creating a data frame with AIC and pseudo R-squared values for each model
model_comparison <- data.frame(
  Model = c("gy_logit1", "gy_logit2", "gy_logit3", "gy_logit4", "gy_logit5", "gy_logit6", "gy_logit7"),
  AIC = c(AIC(gy_logit1), AIC(gy_logit2), AIC(gy_logit3), AIC(gy_logit4), AIC(gy_logit5), AIC(gy_logit6), AIC(gy_logit7)),
  Pseudo_R_Squared = c(gy_logit1$pseudo.r.squared, gy_logit2$pseudo.r.squared, gy_logit3$pseudo.r.squared, gy_logit4$pseudo.r.squared, gy_logit5$pseudo.r.squared, gy_logit6$pseudo.r.squared, gy_logit7$pseudo.r.squared)
)


print(model_comparison)
      Model       AIC Pseudo_R_Squared
1 gy_logit1 -791.6053        0.8198808
2 gy_logit2 -727.0029        0.7706337
3 gy_logit3 -791.7938        0.8229096
4 gy_logit4 -799.3962        0.8250244
5 gy_logit5 -792.9895        0.8202429
6 gy_logit6 -794.4413        0.8228542
7 gy_logit7 -798.5518        0.8262585

So based on the revealed model outcomes, we chose gy_logit4 due to it being the simplest model with the lowest AIC.

Diagnostic Measures

Upon completion of analysis of the optimal model, it is desired to perform diagnostic analyses to inspect the goodness-of-fit of the chosen model. This is done by assessing a measure that determines the proportion of variance in the dependent variable that can be explained by the independent variable and plotting graphs depicting the residuals of the chosen model. The measure that best explains this relationship is the \(R^2\) value. The \(R^2\) is a statistical measure that indicates the percentage of the variance in the dependent variables that is explained by the independent variables collectively.The \(R^2\) formula is explained below. \[ R^2 = 1 - \dfrac{\sum(y_{i} -\hat{y}i)^2}{\sum(y{i} -\bar{y})^2} \] where \(y_{i}\) represents the observed value, \(\hat{y}_i\) is predicted values and \(\bar{y}\) is the mean of the predicted values.

The values of \(R^2\) are between the interval of (0,1) and the higher the \(R^2\) value, the better the independent variables explain the proportion of variance in the dependent variable.Our analysis indicate that gy_logit4 and gy_logit7 have the highest \(R^2\) with a respective value of (0.825 and 0.826). Both of these models are deemed the best models but due to our desire for simplification gy_logit4 is chosen given that it contains 4 variables vs gy_logit7 which contains 7 variables.

Code
model_summary <- summary(gy_logit4)

# Extracting coefficients for the mean model
coefficients_mean <- model_summary$coefficients$mean
coeff_mean_df <- data.frame(round(coefficients_mean, 4))
colnames(coeff_mean_df) <- c("Estimate", "Std. Error", "z value", "Pr(>|z|)")

# Exponentiating the coefficients for odds ratios
odds_ratios <- exp(coefficients_mean[, "Estimate"])
coeff_mean_df$Odds_Ratio <- round(odds_ratios, 4)


log_likelihood <- round(model_summary$loglik, 2)
pseudo_r_squared <- round(model_summary$pseudo.r.squared, 4)


print("Coefficients (Mean Model):")
[1] "Coefficients (Mean Model):"
Code
print(coeff_mean_df)
            Estimate Std. Error  z value Pr(>|z|) Odds_Ratio
(Intercept)  -9.9650     0.7330 -13.5945   0.0000     0.0000
gre_score     0.0088     0.0035   2.5309   0.0114     1.0089
toefl_score   0.0197     0.0063   3.1268   0.0018     1.0199
lor           0.1082     0.0296   3.6520   0.0003     1.1143
cgpa          0.6590     0.0690   9.5556   0.0000     1.9329
research      0.1458     0.0469   3.1089   0.0019     1.1570
Code
cat("\nLog-Likelihood:", log_likelihood, "\nPseudo R-squared:", pseudo_r_squared)

Log-Likelihood: 406.7 
Pseudo R-squared: 0.825
Code
coefficients_df <- data.frame(
  Term = rownames(model_summary$coefficients$mean),
  Estimate = model_summary$coefficients$mean[, "Estimate"],
  Std_Error = model_summary$coefficients$mean[, "Std. Error"],
  Odds_Ratio = odds_ratios
)


coefficients_df <- coefficients_df[coefficients_df$Term != "(Intercept)",]

ggplot(coefficients_df, aes(x = Term, y = Estimate)) +
  geom_point() +
  geom_errorbar(aes(ymin = Estimate - Std_Error, ymax = Estimate + Std_Error), width = 0.2) +
  theme_minimal() +
  coord_flip() + 
  xlab("Model Terms") +
  ylab("Coefficient Estimate") +
  ggtitle("Coefficient Estimates with Standard Errors")

To further analyze the chosen model, we will perform visual diagnostics of the residuals. Let’s start by defining what are residuals. The residuals are the difference between observed values and predicted values.The plot below contains our predicted values,chance of admit, which is represented by each individual data point. The prediction observations made by the model is on the x-axis and the accuracy of the prediction is on the y-axis. The distance from the dotted line,0, represents how bad the prediction was for that value. Examining the plot, we can observe that the residual bounce around the x-axis in somewhat a random manner - they do form figures. A particular residual does not significantly stand out from the others and the residuals band around the 0 mean error line.These are all indicators of a good model.

Code
plot(gy_logit,which = 1, main = "Standardized Residuals  - Model gy_logit 4", sub.caption = "")

Residual plots are important because it allows us to evaluate the prediction errors to understand whether the chosen model will provide you with an acceptable level of accuracy.

Code
baseline <- with(train, data.frame(
  gre_score = mean(gre_score),
  toefl_score = mean(toefl_score),
  lor = mean(lor),
  cgpa = mean(cgpa),
  research = mean(research)
))

# Vary CGPA while keeping other predictors constant
cgpa_range <- seq(min(train$cgpa), max(train$cgpa), length.out = 100)
predictions <- sapply(cgpa_range, function(cgpa) {
  new_data <- baseline
  new_data$cgpa <- cgpa
  predict(gy_logit4, newdata = new_data, type = "response")
})


plot(cgpa_range, predictions, type = 'l', col = 'blue', lwd = 2,
     xlab = 'CGPA', ylab = 'Predicted Chance of Admission',
     main = 'Effect of CGPA on Predicted Chance of Admission')

As the CGPA increases, so does the predicted chance of being admitted. Applicants with higher CGPA scores have a better chance of admission according to our predictive model.

Conclusion

In concluding our study, the application of beta regression models has provided valuable insights into predicting student admission probabilities. The analysis highlights the significance of various factors, such as GRE scores, TOEFL scores, CGPA, university ratings, letters of recommendation, and research experience. Practically speaking, GRE and TOEFL scores are critical as they are standard measures of a student’s academic readiness and language proficiency, which are crucial for success in higher education settings. CGPA reflects consistent academic performance, while university ratings may correlate with the perceived quality and competitiveness of applicants. Letters of recommendation and research experience offer a qualitative assessment of a student’s capabilities and potential contributions to academic discourse.

The models that combined these predictors with the lowest AIC values and the highest pseudo R-squared scores were deemed most effective. They balance predictive accuracy with model simplicity, avoiding the pitfalls of overfitting associated with more complex models. This balance is vital for practical application, ensuring that the model remains generalizable and relevant to various educational contexts.

Looking forward, there are several avenues for further research and development. One potential area is the exploration of additional variables that could impact admission chances, such as extracurricular activities, personal statements, or socio-economic background. Another aspect worth investigating is the application of these models in different educational contexts and geographical locations to assess their universality and adaptability.

Additionally, further refinement of the models could involve exploring alternative statistical techniques or more complex machine learning algorithms that might capture nonlinear relationships and interactions more effectively. It would also be beneficial to consider the ethical implications of using such predictive models in admission processes, ensuring fairness and diversity in student selection.

In summary, this study not only contributes to the academic understanding of factors influencing university admissions but also offers practical implications for educational institutions and policy-making. By leveraging statistical modeling, universities can gain a more nuanced understanding of the admission process, aiding in the development of more informed and equitable admission policies.

References

Abonazel, Mohamed R., Issam Dawoud, Fuad A. Awwad, and Adewale F. Lukman. 2022. “Dawoud–Kibria Estimator for Beta Regression Model: Simulation and Application.” Frontiers in Applied Mathematics and Statistics 8. https://doi.org/10.3389/fams.2022.775068.
Couri, Lucas, Raydonal Ospina, Geiza da Silva, Víctor Leiva, and Jorge Figueroa-Zúñiga. 2022. “A Study on Computational Algorithms in the Estimation of Parameters for a Class of Beta Regression Models.” Mathematics 10 (3): 299. https://doi.org/10.3390/math10030299.
Cribari-Neto, Francisco, and Achim Zeileis. 2010. “Beta Regression in r.” Journal of Statistical Software 34 (2): 1–24. https://doi.org/10.18637/jss.v034.i02.
Douma, Jacob C., and James T. Weedon. 2019. “Analysing Continuous Proportions in Ecology and Evolution: A Practical Introduction to Beta and Dirichlet Regression.” Methods in Ecology and Evolution 10 (9): 1412–30. https://doi.org/https://doi.org/10.1111/2041-210X.13234.
Ferrari, Silvia, and Francisco Cribari-Neto. 2004. “Beta Regression for Modelling Rates and Proportions.” Journal of Applied Statistics 31 (7): 799–815. https://EconPapers.repec.org/RePEc:taf:japsta:v:31:y:2004:i:7:p:799-815.
Guolo, Annamaria, and Cristiano Varin. 2014. Beta regression for time series analysis of bounded data, with application to Canada Google® Flu Trends.” The Annals of Applied Statistics 8 (1): 74–88. https://doi.org/10.1214/13-AOAS684.
Ospina, Raydonal, and Silvia L. P. Ferrari. 2012. “A General Class of Zero-or-One Inflated Beta Regression Models.” Computational Statistics & Data Analysis 56 (6): 1609–23. https://doi.org/https://doi.org/10.1016/j.csda.2011.10.005.
Patrícia L. Espinheira, Silvia L. P. Ferrari, and Francisco Cribari-Neto. 2008. “On Beta Regression Residuals.” Journal of Applied Statistics 35 (4): 407–19. https://doi.org/10.1080/02664760701834931.
Schmid, Florian AND Maloney, Matthias AND Wickler. 2013. “Boosted Beta Regression.” PLOS ONE 8 (4): 1–15. https://doi.org/10.1371/journal.pone.0061623.
Simas, Alexandre B., Wagner Barreto-Souza, and Andréa V. Rocha. 2010. Improved estimators for a general class of beta regression models.” Computational Statistics & Data Analysis 54 (2): 348–66. https://ideas.repec.org/a/eee/csdana/v54y2010i2p348-366.html.